#python #neural search #Jina #deep learning #clip #images

Image Encoders: BigTransfer vs CLIP

Published Sep 18, 2021 by Alex C-G

I’ve been mucking around with building a meme search engine using Jina. To do so I’m testing a couple of different image encoders.

Big Transfer encoder from Google
CLIP image encoder

In essence, these use a neural network to turn an image file into vector embeddings that can be compared for a similarity (“nearest neighbor”) search. Which one is best (at least for memes)? Let’s put them to the test. We’ll index 10,000 memes and compare:

Time it takes to index (in minutes) (note: this covers the running time of the whole Jina process, not just encoding)
Size of the index (in megabytes)
Quality of search results

BigTransfer

flow = (
    Flow()
    .add(
        uses="jinahub+docker://BigTransferEncoder",
        uses_with={"model_name": "Imagenet1k/R50x1", "model_path": "model"},
)

CLIP

flow = (
    Flow()
    .add(
        uses="jinahub+docker://CLIPImageEncoder",
)

Model/query image	CLIP	BigTransfer	Winner
Time to index (via `time python app.py index`)	3:24	1:33	BigTransfer
Index size (via `du -hs workspace`)	111 mb	113 mb	Too close to call
			🤷
			🤷
			🤷
			BigTransfer

So, what have we learned?

Not much really.

Accuracy: Both models perform almost the same in terms of quality, with BigTransfer edging out CLIP when it comes to catching the Sparta meme.
Performance: BigTransfer blows CLIP out of the water, taking less than half the time to index the dataset (and remember, that includes the time Jina takes to spin up).
Disk usage: Not much in it. What’s a couple of megs between friends?
Other resource usage: I didn’t measure memory/CPU usage this time round, but in my tests it felt like BigTransfer was a lot lighter. (Every time I used CLIP before my laptop would fall over without a swap file. That wasn’t the case with BigTransfer.)

This is just how well they perform on a folder of memes though - and memes with similar templates are very similar to each other (otherwise they wouldn’t really be memes, just random photos with some Impact font over the top). The models may perform very differently on a dataset of personal photos where variation would be greater.

Next time maybe I’ll test the meme dataset but with by searching variations of the doge meme. There are plenty of variations and I haven’t seen any of those in the dataset so far. So whichever model matches more closely with the classic Doge meme would be the winner.

Testing notes

I ran the tests on my Thinkpad X1 Carbon gen8 with a 10gb swapfile.
I randomly chose 10,000 memes to index from the imgflip dataset on Kaggle, using a random seed of 42.
All images were normalized to 96x96 resolution.

*****

Image Encoders: BigTransfer vs CLIP

BigTransfer

CLIP

🤷

🤷

🤷

So, what have we learned?

Testing notes